Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

Introduction

where k is the full precision kernels, w is the reconstructed matrix, v is the variance of y,

μ is the mean of the kernels, Ψ is the covariance of the kernels, fm are the features of class

m, and c is the mean of fm.

Zheng et al. [288] deﬁne a new quantization loss between binary weights and learned real

values, where they theoretically prove the necessity of minimizing the weight quantization

loss. Ding et al. [56] propose using distribution loss to explicitly regularize the activation

ﬂow and develop a framework to formulate the loss systematically. Empirical results show

that the proposed distribution loss is robust to selecting the training hyper-parameters.

Reviewing these methods, they all aim to minimize the error and information loss of

quantization, which improves the compactness and capacity of 1-bit CNNs.

1.1.6

Neural Architecture Search

Neural architecture search (NAS) has attracted signiﬁcant attention with remarkable perfor-

mance in various deep learning tasks. Impressive results have been shown for reinforcement

learning (RL), for example,[306]. Recent methods such as diﬀerentiable architecture search

(DARTs) [151] reduce search time by formulating the task in a diﬀerentiable manner. To

reduce redundancy in the network space, partially connected DARTs (PC-DARTs) were

recently introduced to perform a more eﬃcient search without compromising DARTS per-

formance [265].

In Binarized Neural Architecture Search (BNAS) [35], the neural architecture search

is used to search BNNs, and the BNNs obtained by BNAS can outperform conventional

models by a large margin. Another natural approach is to use 1-bit CNNs to reduce the

computation and memory cost of NAS by taking advantage of the strengths of each in a

uniﬁed framework [304]. To accomplish this, a Child-Parent (CP) model is introduced to a

diﬀerentiable NAS to search the binarized architecture (Child) under the supervision of a

full precision model (Parent). In the search stage, the Child-Parent model uses an indicator

generated by the accuracy of the Child-Parent (cp) model to evaluate the performance

and abandon operations with less potential. In the training stage, a kernel-level CP loss

is introduced to optimize the binarized network. Extensive experiments demonstrate that

the proposed CP-NAS achieves a comparable accuracy with traditional NAS on both the

CIFAR and ImageNet databases.

Unlike conventional convolutions, BNAS is achieved by transforming all convolutions in

the search space O into binarized convolutions. They denote the full-precision and binarized

kernels as X and ^ˆX, respectively. A convolution operation in O is represented as Bj =

Bi ⊗^ˆX, where ⊗denotes convolution. To build BNAS, a key step is to binarize the kernels

from X to ^ˆX, which can be implemented based on state-of-the-art BNNs, such as XNOR

or PCNN. To solve this, they introduce channel sampling and reduction in operating space

into diﬀerentiable NAS to signiﬁcantly reduce the cost of GPU hours, leading to an eﬃcient

BNAS.

1.1.7

Optimization

Researchers also explore new training methods to improve BNN performance. These meth-

ods are designed to handle the drawbacks of BNNs. Some borrow popular techniques from

other ﬁelds and integrate them into BNNs, while others make changes based on classical

BNNs training, such as improving the optimizer.

Sari et al. [234] ﬁnd that the BatchNorm layer plays a signiﬁcant role in avoiding explod-

ing gradients, so the standard initialization methods developed for full-precision networks

are irrelevant for BNNs. They also break down BatchNorm components into centering and